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Abstract 

Stochastic Gradient Descent (SGD) is one of the 
most widely used techniques for online optimiza¬ 
tion in machine learning. In this work, we accel¬ 
erate SGD by adaptively learning how to sample 
the most useful training examples at each time 
step. First, we show that SGD can be used to 
learn the best possible sampling distribution of 
an importance sampling estimator. Second, we 
show that the sampling distribution of an SGD al¬ 
gorithm can be estimated online by incrementally 
minimizing the variance of the gradient. The re¬ 
sulting algorithm — called Adaptive Weighted 
SGD (AW-SGD) — maintains a set of parame¬ 
ters to optimize, as well as a set of parameters 
to sample learning examples. We show that AW- 
SGD yields faster convergence in three different 
applications: (i) image classification with deep 
features, where the sampling of images depends 
on their labels, (ii) matrix factorization, where 
rows and columns are not sampled uniformly, 
and (iii) reinforcement learning, where the op¬ 
timized and exploration policies are estimated at 
the same time, where our approach corresponds 
to an off-policy gradient algorithm. 


1 Introduction 

In many real-world problems, one has to face intractable in¬ 
tegrals, such as averaging on combinatorial spaces or non- 
Gaussian integrals. Stochastic approximation is a class of 
methods introduced in 1951 by Herbert Robbins and Sut¬ 
ton Monro m to solve intractable equations by using a se- 
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quence of approximate and random evaluations. Stochastic 
Gradient Descent dzi is a special type of stochastic approx¬ 
imation method that is widely used in large scale learning 
tasks thanks to its good generalization properties O . 

Stochastic Gradient Descent (SGD) can be used to mini¬ 
mize functions of the form: 

7 («;) := E^^p[f{x\w)]= f f{x\w)dP{x) (1) 

Jx 

where P is a known fixed distribution and / is a func¬ 
tion that maps A' x W into 5R, i.e. a family of functions 
on the metric space A’ and parameterized by w G W. 
SGD is a stochastic approximation method that consists in 
doing approximate gradient steps equal on average to the 
true gradient Q. In many applications, including 

supervised learning techniques, the function / is the log- 
likelihood and P is an empirical distribution with density 
n where {mi, • • • , is a set of i.i.d. data 

sampled from an unknown distribution. 

At a given step t, SGD can be viewed as a two-step proce¬ 
dure: (i) sampling G A' according to the distribution P; 
(ii) doing an approximate gradient step with step-size pt’. 

Wt+l =Wt- pi^wf{xt, Wt) (2) 

The convergence properties of SGD are directly linked to 
the variance of the gradient estimate ||4l. Consequently, 
some improvements on this basic algorithm focus on the 
use of (i) parameter averaging O to reduce the variance 
of the final estimator, (ii) the sampling of mini-batches 0 
when multiple points are sampled at the same time to re¬ 
duce the variance of the gradient, and (iii) the use of adap¬ 
tive step sizes to have per-dimension learning rates, e.g., 
AdaGrad Q. 

In this paper, we propose another general technique, which 
can be used in conjunction with the aforementioned ones. 





which is to reduce the gradient variance by learning how to 
sample training points. Rather than learning the fixed opti¬ 
mal sampling distribution and then optimizing the gradient, 
we propose to dynamically learn an optimal sampling dis¬ 
tribution at the same time as the original SGD algorithm. 
Our formulation uses a stochastic process that focuses on 
the minimization of the gradient variance, which amounts 
to do an additional SGD step (to minimize gradient vari¬ 
ance) along each SGD step (to minimize the learning objec¬ 
tive). There is a constant extra cost to pay at each iteration, 
but it is the same for each iteration, and when simulations 
are expensive or the data access is slow, this extra computa¬ 
tional cost is compensated by the increase of convergence 
speed, as quantified in our experiments. 

The paper is organized as follows. After reviewing the re¬ 
lated work in Section we show that SGD can be used 
to find the optimal sampling distribution of an importance 
sampling estimator (Sec.[^. This variance reduction tech¬ 
nique is then used during the iterations of a SGD algo¬ 
rithm by learning how to reduce the variance of the gradient 
(Sec.[^. We then illustrate this algorithm — called Adap¬ 
tive Weighted SGD (AW-SGD) — on three well known 
machine learning problems: image classification (Sec. [^, 
matrix factorization (Sec. |^, and reinforcement learning 
(Sec. [7]). Finally, we conclude with a discussion (Sec.|^. 

2 Related work 

The idea of speeding up learning by modifying the impor¬ 
tance sampling distribution in SGD has been recently an¬ 
alyzed by | 8 ) who showed that a particular choice of the 
sampling distribution could lead to sub-linear performance 
guarantees for support vector machines. We can see our 
approach as a generalization of this idea to other models, 
by including the learning of the sampling distribution as 
part of the optimization. The work of m shows that us¬ 
ing a simple model to choose which data to resample from 
is a useful thing to do, but they do not learn the sampling 
model while optimizing. The two approaches mentioned 
above can be viewed as the extreme case of adaptive sam¬ 
pling, where there is one step to learn the sampling dis¬ 
tribution, and then a second step to learn the model using 
this sampling distribution. The training on language mod¬ 
els has been shown to be faster with adaptive importance 
sampling ifTOl [TTIl . but the authors did not directly mini¬ 
mize the variance of the estimator. 

Regarding variance reduction techniques, in addition to 
the aforementioned ones (Polyak-Ruppert Averaging 0 , 
batching m , and adaptive learning rates like AdaGrad fTll). 
an additional technique is to use control variates (see for 
instance C3). It has been recently used by ca to es¬ 
timate non-conjugate potentials in a variational stochastic 
gradient algorithm. The techniques described in this pa¬ 
per can also be straightforwardly extended to the optimiza¬ 


tion of a control variate. A full derivation is given in the 
appendix, but it was not implemented in the experimental 
section. In the neural net community, adapting the order 
at which the training samples are used is called curricu¬ 
lum learning HU, and our approach can be seen under this 
framework, allthough our algorithm is more general as it 
can speadup the learning on arbitrary integrals, not only 
sums of losses over the training data. 

Another way to obtain good convergence properties is to 
properly scale or rotate the gradient, ideally in the direc¬ 
tion of the inverse Hessian, but this type of second-order 
method is slow in practice. However, one can estimate the 
Hessian greedily, as done in Quasi-Newton methods such 
as Limited Memory BFGS, and then adapt it for the SGD 
algorithm, similarly to || 6 l . 

3 Adaptive Importance Sampling 

We first show in this section that SGD is a powerful tool 
to optimize the sampling distribution of Monte Carlo esti¬ 
mators. This will motivate our Adaptive Weighted SGD 
algorithm in which the sampling distribution is not kept 
constant, but learned during the optimization process. 

We consider a family {Qr} of sampling distributions on 
X, such that Qr is absolutely continuous with respect to 
P (i.e. the support of P is included in the support of Qr) 
for any r in the parametric set T. We denote the density 
Q = Importance sampling is a common method to 
estimate the integral in Eq. Q- It corresponds to a Monte 
Carlo estimator of the form (we omit the dependency on w 
for clarity): 
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T ^ q{xt;T)' 


Qt 1 


(3) 


where Qr is called the importance distribution. It is an 
unbiased estimator of 7 , i.e. the expectation of 7 is exactly 
the desired quantity 7 . 

To compare estimators, we can use a variance criterion. 
The variance of this estimator depends on r: 


0-2 (r) = Var^[7] = -E^ 
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q{x-, t) 


7 

-^( 4 ) 


where [.] and Var^- [.] denote the expectation and vari¬ 
ance with respect to distribution Qr. 

To find the best possible sampling distribution in the sam¬ 
pling family {Qr}, one can minimize the variance cr^(r) 
with respect to r. If |/| belongs to the family {Qr}, then 
there exists a parameter r* G T such that q{.^ T*) « I/I 
P-almost surely. In such a case, the variance cr(r*) of the 
estimator is null: one can estimate the integral with a single 
sample. In general, however, the parametric family does 






Algorithm 1 Minimal Variance Importance Sampling 
Require: Initial sampling parameter vector tq G T 
Require: Learning rates {r]t}t>o 
for t = 0,1, 2, • • • , T — 1 do 
^ Qrt 

Tt+I ^Tt + rit Vr^Ogq{XuTt) 

end for 

Output7^^Et 


not contain a normalized version of |/|. In addition, the 
minimization of the variance has often no closed form 
solution. This motivates the use of approximate variance 
reduction methods. 

A possible approach is to minimize with respect to the 
importance parameter r. The gradient is: 

Vra^T) = 


This quantity has no closed form solution, but we can use 
a SGD algorithm with a gradient step equal on average 
to this quantity. To obtain an estimator g of the gradi¬ 
ent with expectation given by Equation it is enough 
to sample a point Xt according to Qr and then set ^ := 
—f‘^{xt)/q‘^{xt;r)\/r log q{xt;r). This is then repeated 
until convergence. The full iterative procedure is summa¬ 
rized in Algorithmic 

In the experiments below, we show that learning the im¬ 
portance weight of an importance sampling estimator using 
SGD can lead to a significant speed-up in several machine 
learning applications, including the estimation of empirical 
loss functions and the evaluation of a policy in a reinforce¬ 
ment learning scenario. In the following, we show that this 
idea can also be used in a sequential setting (the function 
/ can change over time), and when / has multivariate out¬ 
puts, so that we can control the variance of the gradient 
of a standard SGD algorithm and, ultimately, speedup the 
convergence. 

4 Biased Sampling in Stochastic 
Optimization 

In this section, we first analyze a weighted version of the 
SGD algorithm where points are sampled non-uniformly, 
similarly to importance sampling, and then derive an adap¬ 
tive version of this algorithm, where the sampling distribu¬ 
tion evolves with the iterations. 
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4.1 Weighted stochastic gradient descent 

As introduced previously, our goal is to minimize the ex¬ 
pectation of a parametric function / (cf. Eq. Q). Sim¬ 
ilarly to importance sampling, we do not need to sample 
according to the base distribution P at each iteration of 
SGD. Instead, we can use any distribution Q defined on 
A' if each gradient step is properly re-weighted by the den¬ 
sity q = dQ / dP. Each iteration t of the algorithm consists 
in two steps: (i) sample Xt ^ X according to distribution 
Q\ (ii) do an approximate gradient step: 


=wt- pt 
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Depending on the importance distribution Q, this algorithm 
can have different convergence properties from the original 
SGD algorithm. As mentioned previously, the best sam¬ 
pling distribution would be the one that gives a small vari¬ 
ance to the weighted gradient in Eq. 0. The main issue is 
that it depends on the parameters Wt, which are different at 
each iteration. 

Our main observation is that we can minimize the vari¬ 
ance of the gradient using the previous iterates, under the 
assumption that this variance does not change to quickly 
when Wt is updated. We argue this is reasonable in prac¬ 
tice as learning rate policies for pt usually assume a small 
constant learning rate, or a decreasing schedule O. In the 
next section, we build on that observation to build a new 
algorithm that learns the best sampling distribution Q in an 
online fashion. 


4.2 Adaptive weighted stochastic gradient descent 

Similarly to Section]^ we consider a family {Q^-} of sam¬ 
pling distributions parameterized by r in the parametric 
set T. Using the sampling distribution Q^- with p.d.f. 
q{x;r) = , we can now evaluate the efficiency of the 

sampling distributions Qr based on the variance r): 


t) :=Var^ [V^/(^; w)/q{x, r)] (8) 

\ '^wfix-,w)\/lf{x;w) ' 
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- V,i,7(w)V^7(w) (9) 


Eor a given function f{.;w) we would like to find the pa¬ 
rameter T*{w) of the sampling distribution that minimizes 
the trace of the covariance r), i.e.: 


T*(u;) G argminE^- 

r 


Vwfix;w) 

qix\T) 


( 10 ) 


Note that if the family of sampling distribution {Qr} be¬ 
longs to the exponential family, the problem ( p^ is convex. 
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Figure 1: Generalization performance (test mean Average Precision) and training error (average log loss) in function of 
training time (in seconds) averaged over three independent runs. SGD converged in 45 epochs (outside the graph), whereas 
AW-SGD converged to the same performance in 10 times less epochs for a 982% improvement in training time. 


Algorithm 2 Adaptive Weighted SGD (AW-SGD) 
Require: Initial target and sampling parameter vectors 
rco G W and tq e T 

Require: Learning rates {pt}t>o and {rit}t>o 
for f = 0,1, • • • , T — 1 do 


^ q(xt]Tt) 

Wt+i ^Wt- Ptdt 

Tt+I ^Tt + T]t \\dt\f Vr logq{xt;Tt) 

end for 


and therefore can be solved using (sub-) gradient methods. 
Consequently, a simple SGD algorithm with gradient steps 
having small variance consists in the following two steps at 
each iteration t: 


1. perform a weighted stochastic gradient step using dis¬ 
tribution Qrt to obtain wt ; 


2. compute rt = by solving Equation (10), i.e. 

find the parameter rt minimizing the variance of the 
gradient at point Wf. This can be done approximately 
by applying M steps of stochastic gradient descent. 


In practice, we noted that it is enough to do a single step of 
the inner loop, i.e. M = 1. We call this simplified algo¬ 
rithm the Adapted-Weighted SGD Algorithm and its pseu¬ 
docode is given in AlgorithmWe see that AW-SGD is a 
slight modification of the standard SGD — or any variant of 
it, such as Adagrad, AdaDelta or RMSProp - but where the 
sampling distribution evolves during the algorithm, thanks 
to the update of r^. This algorithm is useful when the gra¬ 
dient has a variance that can be significantly reduced by 
choosing better samples. An important design choice of 
the algorithm is the choice of the decay of the step sizes 
sequences {pt}t>t) and While using adaptive step 

sizes appears to be useful in some settings, it appears that 
the regime in which AW-SGD outperforms SGD is when 
Pt are significantly larger than pt, meaning that the algo¬ 
rithm converges quickly to the smallest variance, and AW- 
SGD tracks it during the course of the iterations. Ideally, 
the sequence of sampling parameters {rt} remains close 
to the optimal trajectory which consist is the best possible 
sequence of sampling parameters given by Equation 

We now illustrate the benefit of this algorithm in three dif¬ 
ferent applications: image classification, matrix factoriza¬ 
tion and reinforcement learning. 


The inner-loop SGD algorithm involved in the second step 
can be based on the current sample, and the stochastic gra¬ 
dient direction is 


Vr^{Wt,T) =Vi-Er 
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5 Application to Image classification 

Large scale image classification is an important machine 
learning tasks where images containing a given category 
are much less frequent than images not containing the cat¬ 
egory. In practice, to learn efficient classifiers, one need 
to optimize a class-imbalance hyper-parameter ifTSl . Fur- 
thermore, as suggested by the standard practice of “hard 
negative mining” ca, positives and negatives should have 
a different importance during optimization, with positives 
























being more important at first, and negatives gradually gain¬ 
ing in importance. However, cross-validating the best im¬ 
balance hyper-parameter at each iteration is prohibitively 
expensive. Instead, we show here that AW-SGD can be 
used for biased sampling depending on the label, where the 
bias Tt (imbalance factor) is adapted along the learning. 

To measure the acceleration of convergence, we experiment 
on the widely used Pascal VOC 2007 image classification 
benchmark (ni. Following standard practice csmsiEoi, 
we learn a One-versus-Rest logistic regression classifier us¬ 
ing deep image features from the last layers of the pre¬ 
trained AlexNet Convolutional Network 1^ . Note that 
this image classification pipeline provides a strong base¬ 
line, comparable to the state of the art 113. 

Let V = {{Ii^yi)^i = 1, • • • , n} a training set of n images 
li with labels yi G { — 1,1}. The discrete distribution over 
samples is parametrized by the log-odd r of the probabil¬ 
ity of sampling a positive image: the family of sampling 
distributions {Qr} over V can be written as: 

Tl 

=——-c;{yiT) ( 12 ) 

niVi) 

with (^{a) := 1/(1 + e~^) representing the sigmoid func- 
tiorQ X = L an image index in {1,..., n}, r G 5R, and 
n(+l) (resp. n(—1)) is the number of positive (resp. neg¬ 
ative) images. With this formulation, the update equations 
in AW-SGD (Algo, [g are: 

f{xt;wt) = i{f{(l)e{Iu);wt),yi,) 

= log (1+ exp(-^i^u;/^(/)0(4))) (13) 
= -log(st) (14) 

with St := ^{yi^wj(j)e{Ii^)) representing the predicted 
probability and 0 the parameters of the feature function. 

V^f{xt]Wt) = (st - l)^i,00(4), 

Vr\ogq{xt]Tt) = yi^{l - s{yi^Tt)) . (15) 

We initialize the positive sampling bias parameter with the 
value To = 0.0, which yields a good performance both 
for SGD and AW-SGD. For both the SGD baseline and 
our AW-SGD algorithm we use AdaGrad Q to choose the 
learning rates pt and pt. Both were initialized at 0.1. 

Figure shows that AW-SGD converges faster than SGD 
for both training error and generalization performance. Ac¬ 
celeration is both in time and in iterations, and AW-SGD 
only costs +1.7% per iteration with respect to SGD in our 
implementation. In further experiments, we noticed that 
the positive sampling bias parameter Tt indeed gradually 
decreases, i.e. the /algorithm learns that it should focus 
more on the harder negative class. We also show that the 

^ Using the sigmoid link enables an optimization in the real 
line instead on the constrained set [0,1] 


values learned for this sampling parameter also depend on 
the category. 


cu 

(U 

E 


03 

03 

Q. 

hO 

c 

■q. 

E 




0.4 


0.2 

0.0 

- 0.2 

- 0.4 


- 0.6 


- 0.8 


- 1.0 


- 1.2 



aeroplane 

bicycle 

bird 

boat 

bottle 

bus 

car 

cat 

chair 

cow 

diningtable 

dog 

horse 

motorbike 

person 


sheep 

sofa 

train 

tvmonitor 


0 10000 20000 30000 40000 50000 60000 70000 

Iteration t 


Figure 2: Evolution of the positive sampling bias parameter 
Tt in function of the training iteration t for the different 
object categories of Pascal VOC 2007. 


Figure displays the evolution of the positive sampling 
bias parameter Tt along AW-SGD iterations t. Almost all 
classes expose the expected behavior of sampling more and 
more negatives as the optimization progresses, as the nega¬ 
tives correspond to anything but the object of interest, and 
are, therefore, much more varied and difficult to model. 
The “person” class is the only exception, because it is, 
by far, the category with the largest number of positives 
and intra-class variation. Note that, although the dynam¬ 
ics of the Tt stochastic process are similar, the exact values 
obtained vary significantly depending on the class, which 
shows the self-tuning capacity of AW-SGD. 

6 Application to Matrix factorization 

We applied AW-SGD to learn how to sample the rows and 
columns in a SGD-based low-rank matrix decomposition 
algorithm. Let Y G be a matrix that has been gen¬ 
erated by a rank- AT matrix UV^, where U G and 

V G We consider a differentiable loss function 

i{z]y) where 2 ; G 5R and y is the observed value. With the 
squared loss, we have each entry of V is a real scalar and 
i{z^y) = {z — y)‘^. The full loss function is 


liU.V) 


,yij) 


i=i j=i 


(16) 








Figure 3: Results of the Minimal Variance Importance 
Sampling algorithm (Algorithm 1.). The curve shows the 
standard deviation of the estimator of the loss 7 as a func¬ 
tion of the number of matrix entries that have been ob¬ 
served. 

We consider the sampling distributions {Qr} over the set 
A' := {1, • • • , n} X {1, • • • , m}, where we independently 
sample a row i and a column j according to the discrete 
distributions ^(r') and ^(r") respectively, with r' G 
t " G r = G and x = We 

define: 

?(z)=(e^^e^^••• (17) 

g(a;,r) (18) 

with ^ W the softmax function. Using the square 

loss, as in the experiments below, the update equations in 
AW-SGD (Algo.[^ are: 

f{xt;ut,vt) =iiuuvj^,yuj,) = (ui,vf^ - pi.j.f = 

(19) 


Vy.j{xt]ut,vt) = 2ui^st ( 21 ) 

Vr'\og q{xt]Tt) = Ci -^(r'), (22) 

Vr" \ogq{xt]Tt) = Cj - q{r') (23) 


where G and ej G SR"^, vectors with 1 at index i and 
j respectively, and all other components are 0 . 

In the matrix factorization experiments, we used the mini¬ 
batch technique with batches of size 100 , po and po were 
tuned to yield the minimum 7 at convergence, separately 
with each algorithm. All results are averaged over 10 runs. 
r' and r" were initialized with zeros to get an initial uni¬ 
form sampling distribution over the rows and columns. The 



Figure 4: Comparison of the convergence speed of the AW- 
SGD algorithm compared to the standard SGD algorithm 
(uniform sampling of rows and columns) on the matrix fac¬ 
torization experiment. 

model learning rate decrease was set to po / {{N/2) 
was kept constant. 

In Figure we simulated sl n x m rank-Ff matrix, for 
n = m = 100 and K = 10, by sampling U and V us¬ 
ing independent centered Gaussian variables with unit vari¬ 
ance. To illustrate the benefit of adaptive sampling, we 
multiply by 100 a randomly drawn square block of size 
20 , to experimentally observe the benefit of a non uniform 
sampling strategy. The results of the minimal variance im¬ 
portant sampling scheme (Algorithm is shown on the 
left. We see that after having seen 50% of the number 
N = nm of matrix entries, the standard deviation of the 
importance sampling estimator is divided by two, meaning 
that we would need only half of the samples to evaluate the 
full loss compared to uniform sampling . Figure shows 
the loss decrease of SGD and AW-SGD and on the same 
matrix for multiple learning rates. The x—axis is expressed 
in epochs, where one epoch corresponds to N sampling 
of values in the matrix. AW-SGD converges significantly 
faster than the best uniformly sampled SGD, even after 1 
epoch through the data. On average, AW-SGD requires half 
of the number of iterations to converge to the same value. 

In-painting experiment We compared both algorithms 
on the MNIST dataset (221, on which low-rank decompo¬ 
sition techniques have been successfully applied (2^ . We 
factorized with K = 50 the training set for the zero digit, 
a 5923 X 784 matrix, where each line is a 28 x 28 image 
of a handwritten zero, and each column one pixel. Figure 
shows the loss decrease for both algorithms on the first 
iteration. AW-SGD requires significantly less samples to 
reach the same error. At convergence, AW-SGD showed 
an average 2.52x speedup in execution time compared to 
SGD, showing that its sampling choices compensate for its 
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Figure 5: Evolution of the training error as a function of 
the number of epochs for the uniformly-sampled SGD and 
AW-SGD for the matrix factorization application applied 
on MNIST data. 

parametrization overhead. 

Non-stationary data On Figurewe progressively sub¬ 
stituted images of handwritten zeros by images of hand¬ 
written ones. It shows, every 2000 samples (i.e. 0.0005 
epoch), the heatmap of the sampling probability of each 
pixel, reorganized as 28 x 28 grids. Substitution 

from zeros to ones was made between 10000 and 20000 
samples (on 2nd line). One can distinctly recognize the 
zero digit first, that progressively fades out for the one 
digij^ This transitions shows that AW-SGD learns to sam¬ 
ple the digits that are likely to have a high impact on the 
loss. The algorithm adapts online to changes in the under¬ 
lying distribution (transitions from one digit to another). 

Combined with adaptive step size algorithms such as Ada- 
Grad, we noticed that Adagrad did not improved the con¬ 
vergence speed of AW-SGD in our matrix factorization ex¬ 
periments. A possible explanation is that the adaptive sam¬ 
pling favors some rows and columns, and AdaGrad com¬ 
pensates the non-uniform sampling, such that using AW- 
SGD and AdaGrad simultaneously converges only slightly 
faster than AdaGrad alone. It should behave similarly on 
other parameterizations of r where r indices are linked to 
parameters indices. However, in many of our experiments, 
Adagrad performances were not matching the best cross- 
validated learning rates. 

7 Sequential Control through Policy 
Gradient 

Stochastic optimization is currently one of the most popu¬ 
lar approaches for policy learning in the context of Markov 
Decision Processes. More precisely, policy gradient has 

^We created an animated gif with more of these images and 
inserted it in the supplementary material. 


become the method of choice in a large number of contexts 
in reinforcement learning 1^ flSj . Here, optimizing the 
integral (1) is related to policy gradient algorithms which 
aim at minimizing an expected loss (i.e. a negative reward 
or a cost) or maximizing a reward in an episodic setting (i.e. 
with a predefined finite trajectory length) and off-policy es¬ 
timation. Equivalently, if we consider the sampling space 
as being the (action, state) trajectory of a Markov Decision 
Process, AW-SGD can be viewed as a off-policy gradient 
algorithm, where and Qr have the same parameteri¬ 
zation, i.e. W = T. The objective is to maximize the 
expected reward for the target policy Pw, and to minimize 
the variance of the gradient for the policy gradient for the 
exploration policy Qr- 

We considered a canonical grid-world problem 1^ [TTI 
with a squared grid of size i is considered. A classical re¬ 
ward setting has been applied: the reward function is a dis¬ 
counted instantaneous reward of —1 assigned on each cell 
of the grid and a reward of 1000 for a terminal state located 
at the down right of the grid. In this context, an episode 
is considered as successful if the defined terminal state is 
reached. Finally, a random distribution of ntrap = ^ ter¬ 
minal states with a negative reward of —1000 are also po¬ 
sitioned. The start state is located at the very up-left cell of 
the grid. 

In this experimental setting, the parameters w and r of the 
target policy P^ and the exploration policy are defined 
in the space More precisely, the probability of an 

action a at each position {x, y} G [1, follows a multi¬ 
nomial distribution of parameters Indeed, 

in the context of the grid type of environment that we will 
use in this section, these parameters basically correspond to 
the log-odds of the probability of moving in one of the four 
directions at each position of the grid (movements outside 
the grid do not change the position). The distribution Qj- 
of sampled trajectories are different from the distribution 
of trajectory derived from Py^ (off-policy learning). 

The policy is optimized using Algorithm The baseline 
corresponds to a policy iteration based on SGD where tra¬ 
jectories are sampled using their current policy estimate 
(on-policy learning). On Table the table gives the av¬ 
erage means and variances obtained for a batch of 20000 
learning trials using both algorithms with properly tuned 
learning rate (the optimal learning rate is different in the 
two algorithms, for SGD p = 2.1 and for AWSGD p = 
0.003 has been found). We can see that for all the tested 
grid sizes, there is a significant improvement (close to 10% 
relative improvement) of the expected success when adap¬ 
tive weighted SGD is used instead of the on-policy learning 
SGD algorithm. 
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Figure 6: Illustration of the evolution of the sampling dis¬ 
tribution when data are not i.i.d. Each heatmap contains 
the sampling probability of each pixel in the MNIST ma¬ 
trix factorization experiment. 


8 Adapting to Non-Uniform Architectures 

In many large scale infrastructures, such as computation 
servers shared by many people, data access or gradient 
computation are unknown in advance. For example, in 
large scale image classification, some images might be 
stored in the RAM, leading to a access time that is order 
of times faster than other images stored only on the hard 
drive. These hardware systems are sometimes called Non- 
Uniform Memory Access 1^ . It is also the case in matrix 
factorization, when some embeddings are stored locally, 
and others downloaded through the network. How can we 
inform the algorithm that it should sample more often data 
stored locally? 

A simple modification of AW-SGD can enable the algo¬ 
rithm to adapt to non-uniform computation time. The key 
idea is to learn dynamically to minimize the expected loss 
decrease per unit of time. To make AW-SGD take into ac¬ 
count this access time, we simply weighted the update of r 
in Algorithmic by dividing it by the simulated access time 
Ax to the sample x. This is summarized in Algorithmic 

As an experiment, we show in the matrix factorization case 
that the time-aware AW-SGD is able to learn and exploit 
the underlying hardware when the data does not fit entirely 
in memory, and one part of them has an extra access cost. 
To do so, we generate a 1000 x 1000 rank-10 matrix, but 
without high variance block, so that variance is uniform 
across rows and columns. For the first half of the rows 
of the matrix, i.e. i < f, we consider the data as be¬ 
ing in main memory, and simulate an access cost of 100ns 
for each sampling in those rows, inspired by Jeff Dean’s 
explanations (291. For the other half of the rows, i >= |, 


£ 

AW-SGD 

SGD 

15 

50 

80 

0.91 ±0.021 

0.85 ±0.031 
0.81 ±0.046 

0.85 ±0.032 
0.77 ±0.042 
0.74 ±0.056 


Table 1: Probability of success, e.g. reaching the end point, 
for various environment sizes. 


Algorithm 3 Time-Aware AW-SGD) 

Require: Initial values for icq G W and tq G T 
Require: Learning rates {pt}t>o and {r]t}t>o 
for t = 0,1, • • • ,T - 1 do 
St ^ get Current Time 0 
^ Qrt 

J , Vvufixt-,Wt) 
t q{xt-,Tt) 

Wt+l ^wt- ptdt 

et ^ getCurrentTime() 

Tt+l ^ T( + \\dtfVrlogq{xt-,Tt) 

end for 


we multiply that access cost by a factor /, we’ll call the 
slow block access factor. The simulated access time to the 
sample (i, j), A(j a) is thus given by: A(a a) = 10“^ x f 

if. >= f.Ld A^. = lO-’if. U f We ranged me 

factor / from 2 to 2^^. 

The time speedup achieved by the time-aware AW-SGD 
against SGD is plotted against the evolution of this factor 
in Figure |^ For each algorithm, we summed the real exe¬ 
cution time and the simulated access times in order to take 
into account the time-aware AW-SGD sampling overhead. 
The speedup is computed after one epoch, by dividing SGD 
total time by AW-SGD total time. Positive speedups starts 
with a slow access time factor / of roughly 200, which cor¬ 
responds to a random read on a SSD. Below AW-SGD is 
slower, since the data is homogeneous, and time access dif¬ 
ference is not yet big enough to compensate its overhead. 
At / = 5000, corresponding to a read from another com¬ 
puter’s memory on local network, speedup reaches 10 x. 
At / = 50000, a hard drive seek, AW-SGD is 100 x faster. 
This shows that the time-aware AW-SGD overhead is com¬ 
pensated by its sampling choices. 

Figure 1^ shows the loss decrease of both algorithms on the 
5 first epochs with / = 5000. It shows that if the access 
time was the uniform, AW-SGD would have the same con¬ 
vergence speed as standard SGD (this is expected by the 
design of this experiment). Hence, even in such case where 
there is no theoretical benefit of using the time-aware AW- 
SGD in terms of epochs, the fact that we learn the underly¬ 
ing access time to bias the sampling could potentially lead 
to huge improvements of the convergence time. 












AW-SGD speedup on non-uniform time access matrix 



Figure 7: Evolution of the training error as a function of 
the number of epochs on the simulated matrix with differ¬ 
ent access costs, with / = 5000, for the uniformly-sampled 
SGD and AW-SGD using best po for each algorithm. 

9 Conclusion 

In this work, we argue that SGD and importance sampling 
can strongly benefit from each other. SGD algorithms can 
be used to learn the minimal variance sampling distribu¬ 
tion, while importance sampling techniques can be used 
to improve the gradient estimation of SGD algorithm. We 
have introduced a simple yet efficient Adaptive Weighted 
SGD algorithm that can optimize a function while optimiz¬ 
ing the way it samples the examples. We showed that this 
framework can be used in a large variety of problems, and 
experimented with it in three domains that have apparently 
no direct connections: image classification, matrix factor¬ 
ization and reinforcement learning. In all the cases, we can 
gain a significant speed-up by optimizing the way the sam¬ 
ples are generated. 

There are many more applications in which these variance 
reduction techniques have a strong potential. For example, 
in variational inference, the objective function is an inte¬ 
gral and SGD algorithms are often used to increase conver¬ 
gence Eoiiia. Computing these integrals stochastically 
could be made more efficient by sampling non-uniformly 
in the integration space. Also, the estimation of intractable 
log-partition function, such as Boltzmann machines, are 
potential candidate models in which importance sampling 
has already been proposed, but without variance reduction 
technique ED 

This work also shows that we can learn about the algorithm 
while optimizing, as shown by the time-aware AW-SGD. 
This idea can be extended to design new types of meta¬ 
algorithms that learn to optimize or learn to coach other 
algorithms. 



Figure 8: Evolution of the training error as a function of 
the number of epochs on the simulated matrix with differ¬ 
ent access costs, with / = 5000, for the uniformly-sampled 
SGD and AW-SGD using best po for each algorithm. 
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